Efficient algorithms for exact hierarchical clustering of huge datasets: Tackling the entire protein space
نویسندگان
چکیده
Motivation: UPGMA (average-linkage clustering) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. UPGMA however, is a complete-linkage method, in the sense that all edges between data points are needed in memory. Due to this prohibitive memory requirement UPGMA is not scalable for very large datasets. Results: We present novel memory-constrained UPGMA (MCUPGMA) algorithms. Given a constrained memory size, our algorithm guarantees the exact same UPGMA clustering solution, without explicitly holding all edges in memory. Our algorithms are general, and applicable to any dataset. We present a theoretical characterization of the algorithm efficiency, and hardness for various data. We show the performance of our algorithm , under restricted memory constraints. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a novel evolutionary tree of all proteins using no prior knowledge. We show that newly created tree captures protein families better than state-of-the-art large scale methods such as CluSTr, ProtoNet4, or single-linkage clustering. The robustness of UPGMA improves significantly on existing methods, especially for multi-domain proteins, and for large or divergent families. Our algorithm is scalable to any feasible increase in sequence databse sizes. Availability: The evolutionary tree of all proteins in the entire UniProt set, together with navigation and classification tools will be made available as part the ProtoNet service. A C++ implementation of the algorithm, suitable for any type or size a data, is available. Contact: [email protected]
منابع مشابه
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
MOTIVATION UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. APPLICATION We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any pr...
متن کاملAn improved opposition-based Crow Search Algorithm for Data Clustering
Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملGraph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members
Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...
متن کاملA Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)
Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...
متن کامل